The Ex Project: Web Information Extraction Using Extraction Ontologies

نویسندگان

Martin Labský

Vojtech Svátek

Marek Nekvasil

Dusan Rak

چکیده

Extraction ontologies represent a novel paradigm in web information extraction (as one of ‘deductive’ species of web mining) allowing to swiftly proceed from initial domain modelling to running a functional prototype, without the necessity of collecting and labelling large amounts of training examples. Bottlenecks in this approach are however the tedium of developing an extraction ontology adequately covering the semantic scope of web data to be processed and the difficulty of combining the ontology-based approach with inductive or wrapper-based approaches. We report on an ongoing project aiming at developing a web information extraction tool based on richly-structured extraction ontologies and with additional possibility of (1) semi-automatically constructing these from third-party domain ontologies, (2) absorbing the results of inductive learning for subtasks where pre-labelled data abound, and (3) actively exploiting formatting regularities in the wrapper style. Martin Labský Dept. of Information and Knowledge Engineering, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Praha 3, Czech Republic e-mail: [email protected] Vojtěch Svátek Dept. of Information and Knowledge Engineering, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Praha 3, Czech Republic e-mail: [email protected] Marek Nekvasil Dept. of Information and Knowledge Engineering, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Praha 3, Czech Republic e-mail: [email protected] Dušan Rak Dept. of Information and Knowledge Engineering, University of Economics, Prague, W. Churchill Sq. 4, 130 67 Praha 3, Czech Republic e-mail: [email protected]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Ontology-Based Information Extraction

Information Extraction (IE) aims to retrieve certain types of information from natural language text by processing them automatically. Ontology-Based Information Extraction (OBIE) has recently emerged as a subfield of Information Extraction. Here, ontologies which provide formal and explicit specifications of conceptualizations play a crucial role in the information extraction process. Because ...

متن کامل

Combining Multiple Sources of Evidence in Web Information Extraction

Extraction of meaningful content from collections of web pages with unknown structure is a challenging task, which can only be successfully accomplished by exploiting multiple heterogeneous resources. In the Ex information extraction tool, so-called extraction ontologies are used by human designers to specify the domain semantics, to manually provide extraction evidence, as well as to define ex...

متن کامل

Use of Ontologies for Cross-lingual Information Management in the Web

We present the ontology-based approach for crosslingual information management of web content that has been developed by the EC-funded project CROSSMARC. CROSSMARC can be perceived as a meta-search engine, which identifies domainspecific information from the Web. To achieve this, it employs agents for web crawling, spidering, information extraction from web pages, data storage, and data present...

متن کامل

Types and Roles of Ontologies in Web Information Extraction

We discuss the diverse types and roles of ontologies in web information extraction and illustrate them on a small study from the product offer domain. Attention is mainly paid to the impact of domain ontologies, presentation ontologies and terminological taxonomies.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

The Ex Project: Web Information Extraction Using Extraction Ontologies

نویسندگان

چکیده

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Ontology-Based Information Extraction

Combining Multiple Sources of Evidence in Web Information Extraction

Use of Ontologies for Cross-lingual Information Management in the Web

Types and Roles of Ontologies in Web Information Extraction

عنوان ژورنال:

اشتراک گذاری